Visual exploration of ploidy levels - Simulation data

Introduction

PloidyNGS was developed for a visual exploration of genome ploidy levels using Next Generation Sequencing data.

The main software is explorePloidyNGS.py, which is used to generate a graphic plotting the distribution of allele proportions across heterozygous positions in the genome, assuming biallelic positions. It assumes that distribution of allele proportion in these positions reflects the ploidy level of the organism under study. For example, one would expect most heterozygous positions with 50% of each allele in diploid genomes, whereas proportions of 33.3% and 66.6% would be expected for a triploid genome.

The software was tested with simulated and real data of different strains of the budding yeast Saccharomyces cerevisiae with the haploid genome of S. cerevisiae s288c as reference.

Requirements

Usage

$ explorePloidyNGS.py --out outTable.txt --bam map_masked_genome.sorted.bam --genome masked_genome.fasta

Inputs:

* Mapping step can be carried out using e.g. Bowtie2

** A masked genome sequence can be obtained using e.g. RepeatMasker

explorePloidyNGS.py outputs a table (outTable.txt in the example above) with percentage of each observed alleles in each position (A,T,C or G) and a graphic (see simulated data below) that allows for a visual exploration of the ploidy by observing peaks of observed proportion of alleles in heterozygous positions.

Check the documentation (README) for other options.

Simulated Data

The software simulatePloidyData.py was developed to generate simulated genomes with different ploidy and heterozigosity levels.

Here is an example of how the software is used:

$ simulatePloidy.py --genome genome.fna --ploidy 3 --heterozygosity 0.01

genome.fna is an unmasked haploid genome sequence, such as Saccharomyces cerevisiae s288c

--ploidy preceeds the ploidy level of the simulated genome. 3 indicates there will be three copies of each chromosome given as input

--heterozygosity preceeds the heterozigozity level of the simulated genome. 0.1 indicates one would expect a heterozigous positons every each approximately 10 bases, while 0.001 indicates a heterozigous position every approximately 1000 bases

Description of generation of simulated genomes

For a given ploidy level, there are different allowed allele dosages. Firstly, the software takes the user defined ploidy and generates a table indexed with all possible biallelic dosages. Next, it creates another table (hash) for further storage of allele dosages in each heterozygous position. It then walks through each genomic position assigning random heterozygous positions every approximately X bases, given the user defined heterozygosity level (probability assigned as command-line argument).

For each heterozygous position, the software takes a randomly selected biallelic dosage from the dosage table (by its indexes), the original allele in that position, and a second, randomly selected allele (any DNA nucleotide different from the original allele). At this step, a list with the characters to add in each final chromosome is stored in memory. For example, for a given heterozygous position with an A as original allele and generation of tetraploid genome sequence, a list would store either A,T,T,T (1/3), A,A,T,T (2/2) or A,A,A,T (3/1), for further assignment in that position of chromosome copy 1, 2, 3, and 4, respectively. Notice that any of the other two nucleotides (C or G) would also be possible.

The final stage is the generation of all copies of chromosome sets for the user defined ploidy. For each copy, the original allele is written as in the original haploid genome sequence provided by user, except for heterozygous positions. In these positions, the allele of a given chromosome is added according to the index of the list of alleles in that position, as previosly assigned. A multi-FASTA file is generated for each created genome, with a number in each FASTA header for chromosomes according to the given chromosome set. For example, tetraploid genomes will present for FASTA sequences for each chromosomoe, with copy numbers 1, 2, 3, and 4. This multi-FASTA is then used for simulation of Illumina paired-end reads with different coverages.

The table below shows links to results of explorePloidyNGS for simulated Illumina paired-end reads generated using ART and mapped to the haploid masked genome of S. cerevisiae s288c using Bowtie2. Simulated reads were generated for simulated genomes with varying ploidy (di-, tri-, tetra-, penta-, hexa-, and heptaploid) and heterozigosity (from 0.1 to 0.0001) levels, with different sequencing coverage depths (15x, 25x, 50x, and 100x).

Notice that visual resolution of allele proportions (and consequently capacity of estimating ploidy levels) improves with higher heterozigosity level and sequencing coverage depth.

Ploidy levelSeq. CoverageHeterozigosity level
0.00010.0010.010.1
2
15x
25x
50x
100x
Ploidy2_Heter0.0001_15
Ploidy2_Heter0.0001_25
Ploidy2_Heter0.0001_50
Ploidy2_Heter0.0001_100
Ploidy2_Heter0.001_15
Ploidy2_Heter0.001_25
Ploidy2_Heter0.001_50
Ploidy2_Heter0.001_100
Ploidy2_Heter0.01_15
Ploidy2_Heter0.01_25
Ploidy2_Heter0.01_50
Ploidy2_Heter0.01_100
Ploidy2_Heter0.1_15
Ploidy2_Heter0.1_25
Ploidy2_Heter0.1_50
Ploidy2_Heter0.1_100
3
15x
25x
50x
100x
Ploidy3_Heter0.0001_15
Ploidy3_Heter0.0001_25
Ploidy3_Heter0.0001_50
Ploidy3_Heter0.0001_100
Ploidy3_Heter0.001_15
Ploidy3_Heter0.001_25
Ploidy3_Heter0.001_50
Ploidy3_Heter0.001_100
Ploidy3_Heter0.01_15
Ploidy3_Heter0.01_25
Ploidy3_Heter0.01_50
Ploidy3_Heter0.01_100
Ploidy3_Heter0.1_15
Ploidy3_Heter0.1_25
Ploidy3_Heter0.1_50
Ploidy3_Heter0.1_100
4
15x
25x
50x
100x
Ploidy4_Heter0.0001_15
Ploidy4_Heter0.0001_25
Ploidy4_Heter0.0001_50
Ploidy4_Heter0.0001_100
Ploidy4_Heter0.001_15
Ploidy4_Heter0.001_25
Ploidy4_Heter0.001_50
Ploidy4_Heter0.001_100
Ploidy4_Heter0.01_15
Ploidy4_Heter0.01_25
Ploidy4_Heter0.01_50
Ploidy4_Heter0.01_100
Ploidy4_Heter0.1_15
Ploidy4_Heter0.1_25
Ploidy4_Heter0.1_50
Ploidy4_Heter0.1_100
5
15x
25x
50x
100x
Ploidy5_Heter0.0001_15
Ploidy5_Heter0.0001_25
Ploidy5_Heter0.0001_50
Ploidy5_Heter0.0001_100
Ploidy5_Heter0.001_15
Ploidy5_Heter0.001_25
Ploidy5_Heter0.001_50
Ploidy5_Heter0.001_100
Ploidy5_Heter0.01_15
Ploidy5_Heter0.01_25
Ploidy5_Heter0.01_50
Ploidy5_Heter0.01_100
Ploidy5_Heter0.1_15
Ploidy5_Heter0.1_25
Ploidy5_Heter0.1_50
Ploidy5_Heter0.1_100
6
15x
25x
50x
100x
Ploidy6_Heter0.0001_15
Ploidy6_Heter0.0001_25
Ploidy6_Heter0.0001_50
Ploidy6_Heter0.0001_100
Ploidy6_Heter0.001_15
Ploidy6_Heter0.001_25
Ploidy6_Heter0.001_50
Ploidy6_Heter0.001_100
Ploidy6_Heter0.01_15
Ploidy6_Heter0.01_25
Ploidy6_Heter0.01_50
Ploidy6_Heter0.01_100
Ploidy6_Heter0.1_15
Ploidy6_Heter0.1_25
Ploidy6_Heter0.1_50
Ploidy6_Heter0.1_100
7
15x
25x
50x
100x
Ploidy7_Heter0.0001_15
Ploidy7_Heter0.0001_25
Ploidy7_Heter0.0001_50
Ploidy7_Heter0.0001_100
Ploidy7_Heter0.001_15
Ploidy7_Heter0.001_25
Ploidy7_Heter0.001_50
Ploidy7_Heter0.001_100
Ploidy7_Heter0.01_15
Ploidy7_Heter0.01_25
Ploidy7_Heter0.01_50
Ploidy7_Heter0.01_100
Ploidy7_Heter0.1_15
Ploidy7_Heter0.1_25
Ploidy7_Heter0.1_50
Ploidy7_Heter0.1_100

Citing PloidyNGS

Renato Augusto Correa dos Santos and Diego Mauricio Riano Pachon. PloidyNGS: Visually exploring ploidy with Next Generation Sequencing Data (submitted to Bioinformatics Oxford in June/2016)

Ploidy3_Heter0.1_25

Top

Ploidy level3
Heterozigosity0.1
Sequencing coverage25

Ploidy3_Heter0.1_25

Top

Ploidy2_Heter0.01_25

Top

Ploidy level2
Heterozigosity0.01
Sequencing coverage25

Ploidy2_Heter0.01_25

Top

Ploidy4_Heter0.1_100

Top

Ploidy level4
Heterozigosity0.1
Sequencing coverage100

Ploidy4_Heter0.1_100

Top

Ploidy7_Heter0.01_25

Top

Ploidy level7
Heterozigosity0.01
Sequencing coverage25

Ploidy7_Heter0.01_25

Top

Ploidy6_Heter0.1_15

Top

Ploidy level6
Heterozigosity0.1
Sequencing coverage15

Ploidy6_Heter0.1_15

Top

Ploidy5_Heter0.001_15

Top

Ploidy level5
Heterozigosity0.001
Sequencing coverage15

Ploidy5_Heter0.001_15

Top

Ploidy7_Heter0.001_100

Top

Ploidy level7
Heterozigosity0.001
Sequencing coverage100

Ploidy7_Heter0.001_100

Top

Ploidy3_Heter0.001_15

Top

Ploidy level3
Heterozigosity0.001
Sequencing coverage15

Ploidy3_Heter0.001_15

Top

Ploidy2_Heter0.01_15

Top

Ploidy level2
Heterozigosity0.01
Sequencing coverage15

Ploidy2_Heter0.01_15

Top

Ploidy6_Heter0.01_15

Top

Ploidy level6
Heterozigosity0.01
Sequencing coverage15

Ploidy6_Heter0.01_15

Top

Ploidy2_Heter0.01_50

Top

Ploidy level2
Heterozigosity0.01
Sequencing coverage50

Ploidy2_Heter0.01_50

Top

Ploidy3_Heter0.0001_15

Top

Ploidy level3
Heterozigosity0.0001
Sequencing coverage15

Ploidy3_Heter0.0001_15

Top

Ploidy4_Heter0.1_15

Top

Ploidy level4
Heterozigosity0.1
Sequencing coverage15

Ploidy4_Heter0.1_15

Top

Ploidy4_Heter0.01_25

Top

Ploidy level4
Heterozigosity0.01
Sequencing coverage25

Ploidy4_Heter0.01_25

Top

Ploidy5_Heter0.1_25

Top

Ploidy level5
Heterozigosity0.1
Sequencing coverage25

Ploidy5_Heter0.1_25

Top

Ploidy4_Heter0.0001_50

Top

Ploidy level4
Heterozigosity0.0001
Sequencing coverage50

Ploidy4_Heter0.0001_50

Top

Ploidy4_Heter0.0001_15

Top

Ploidy level4
Heterozigosity0.0001
Sequencing coverage15

Ploidy4_Heter0.0001_15

Top

Ploidy7_Heter0.01_15

Top

Ploidy level7
Heterozigosity0.01
Sequencing coverage15

Ploidy7_Heter0.01_15

Top

Ploidy7_Heter0.001_15

Top

Ploidy level7
Heterozigosity0.001
Sequencing coverage15

Ploidy7_Heter0.001_15

Top

Ploidy5_Heter0.01_25

Top

Ploidy level5
Heterozigosity0.01
Sequencing coverage25

Ploidy5_Heter0.01_25

Top

Ploidy6_Heter0.001_15

Top

Ploidy level6
Heterozigosity0.001
Sequencing coverage15

Ploidy6_Heter0.001_15

Top

Ploidy3_Heter0.001_50

Top

Ploidy level3
Heterozigosity0.001
Sequencing coverage50

Ploidy3_Heter0.001_50

Top

Ploidy4_Heter0.01_100

Top

Ploidy level4
Heterozigosity0.01
Sequencing coverage100

Ploidy4_Heter0.01_100

Top

Ploidy6_Heter0.1_100

Top

Ploidy level6
Heterozigosity0.1
Sequencing coverage100

Ploidy6_Heter0.1_100

Top

Ploidy2_Heter0.0001_100

Top

Ploidy level2
Heterozigosity0.0001
Sequencing coverage100

Ploidy2_Heter0.0001_100

Top

Ploidy2_Heter0.1_25

Top

Ploidy level2
Heterozigosity0.1
Sequencing coverage25

Ploidy2_Heter0.1_25

Top

Ploidy5_Heter0.0001_25

Top

Ploidy level5
Heterozigosity0.0001
Sequencing coverage25

Ploidy5_Heter0.0001_25

Top

Ploidy3_Heter0.1_100

Top

Ploidy level3
Heterozigosity0.1
Sequencing coverage100

Ploidy3_Heter0.1_100

Top

Ploidy2_Heter0.001_15

Top

Ploidy level2
Heterozigosity0.001
Sequencing coverage15

Ploidy2_Heter0.001_15

Top

Ploidy7_Heter0.01_100

Top

Ploidy level7
Heterozigosity0.01
Sequencing coverage100

Ploidy7_Heter0.01_100

Top

Ploidy7_Heter0.1_100

Top

Ploidy level7
Heterozigosity0.1
Sequencing coverage100

Ploidy7_Heter0.1_100

Top

Ploidy2_Heter0.01_100

Top

Ploidy level2
Heterozigosity0.01
Sequencing coverage100

Ploidy2_Heter0.01_100

Top

Ploidy4_Heter0.001_50

Top

Ploidy level4
Heterozigosity0.001
Sequencing coverage50

Ploidy4_Heter0.001_50

Top

Ploidy5_Heter0.001_25

Top

Ploidy level5
Heterozigosity0.001
Sequencing coverage25

Ploidy5_Heter0.001_25

Top

Ploidy5_Heter0.001_50

Top

Ploidy level5
Heterozigosity0.001
Sequencing coverage50

Ploidy5_Heter0.001_50

Top

Ploidy3_Heter0.01_25

Top

Ploidy level3
Heterozigosity0.01
Sequencing coverage25

Ploidy3_Heter0.01_25

Top

Ploidy3_Heter0.01_50

Top

Ploidy level3
Heterozigosity0.01
Sequencing coverage50

Ploidy3_Heter0.01_50

Top

Ploidy6_Heter0.001_100

Top

Ploidy level6
Heterozigosity0.001
Sequencing coverage100

Ploidy6_Heter0.001_100

Top

Ploidy6_Heter0.0001_100

Top

Ploidy level6
Heterozigosity0.0001
Sequencing coverage100

Ploidy6_Heter0.0001_100

Top

Ploidy3_Heter0.01_100

Top

Ploidy level3
Heterozigosity0.01
Sequencing coverage100

Ploidy3_Heter0.01_100

Top

Ploidy3_Heter0.01_15

Top

Ploidy level3
Heterozigosity0.01
Sequencing coverage15

Ploidy3_Heter0.01_15

Top

Ploidy2_Heter0.1_50

Top

Ploidy level2
Heterozigosity0.1
Sequencing coverage50

Ploidy2_Heter0.1_50

Top

Ploidy5_Heter0.0001_15

Top

Ploidy level5
Heterozigosity0.0001
Sequencing coverage15

Ploidy5_Heter0.0001_15

Top

Ploidy4_Heter0.01_50

Top

Ploidy level4
Heterozigosity0.01
Sequencing coverage50

Ploidy4_Heter0.01_50

Top

Ploidy6_Heter0.0001_25

Top

Ploidy level6
Heterozigosity0.0001
Sequencing coverage25

Ploidy6_Heter0.0001_25

Top

Ploidy7_Heter0.1_25

Top

Ploidy level7
Heterozigosity0.1
Sequencing coverage25

Ploidy7_Heter0.1_25

Top

Ploidy4_Heter0.001_25

Top

Ploidy level4
Heterozigosity0.001
Sequencing coverage25

Ploidy4_Heter0.001_25

Top

Ploidy7_Heter0.001_50

Top

Ploidy level7
Heterozigosity0.001
Sequencing coverage50

Ploidy7_Heter0.001_50

Top

Ploidy3_Heter0.1_15

Top

Ploidy level3
Heterozigosity0.1
Sequencing coverage15

Ploidy3_Heter0.1_15

Top

Ploidy6_Heter0.001_25

Top

Ploidy level6
Heterozigosity0.001
Sequencing coverage25

Ploidy6_Heter0.001_25

Top

Ploidy4_Heter0.0001_25

Top

Ploidy level4
Heterozigosity0.0001
Sequencing coverage25

Ploidy4_Heter0.0001_25

Top

Ploidy2_Heter0.1_15

Top

Ploidy level2
Heterozigosity0.1
Sequencing coverage15

Ploidy2_Heter0.1_15

Top

Ploidy3_Heter0.0001_50

Top

Ploidy level3
Heterozigosity0.0001
Sequencing coverage50

Ploidy3_Heter0.0001_50

Top

Ploidy7_Heter0.1_50

Top

Ploidy level7
Heterozigosity0.1
Sequencing coverage50

Ploidy7_Heter0.1_50

Top

Ploidy6_Heter0.0001_50

Top

Ploidy level6
Heterozigosity0.0001
Sequencing coverage50

Ploidy6_Heter0.0001_50

Top

Ploidy3_Heter0.001_100

Top

Ploidy level3
Heterozigosity0.001
Sequencing coverage100

Ploidy3_Heter0.001_100

Top

Ploidy7_Heter0.01_50

Top

Ploidy level7
Heterozigosity0.01
Sequencing coverage50

Ploidy7_Heter0.01_50

Top

Ploidy2_Heter0.0001_25

Top

Ploidy level2
Heterozigosity0.0001
Sequencing coverage25

Ploidy2_Heter0.0001_25

Top

Ploidy3_Heter0.0001_100

Top

Ploidy level3
Heterozigosity0.0001
Sequencing coverage100

Ploidy3_Heter0.0001_100

Top

Ploidy3_Heter0.0001_25

Top

Ploidy level3
Heterozigosity0.0001
Sequencing coverage25

Ploidy3_Heter0.0001_25

Top

Ploidy5_Heter0.1_50

Top

Ploidy level5
Heterozigosity0.1
Sequencing coverage50

Ploidy5_Heter0.1_50

Top

Ploidy7_Heter0.001_25

Top

Ploidy level7
Heterozigosity0.001
Sequencing coverage25

Ploidy7_Heter0.001_25

Top

Ploidy5_Heter0.1_15

Top

Ploidy level5
Heterozigosity0.1
Sequencing coverage15

Ploidy5_Heter0.1_15

Top

Ploidy2_Heter0.001_100

Top

Ploidy level2
Heterozigosity0.001
Sequencing coverage100

Ploidy2_Heter0.001_100

Top

Ploidy7_Heter0.0001_50

Top

Ploidy level7
Heterozigosity0.0001
Sequencing coverage50

Ploidy7_Heter0.0001_50

Top

Ploidy5_Heter0.0001_100

Top

Ploidy level5
Heterozigosity0.0001
Sequencing coverage100

Ploidy5_Heter0.0001_100

Top

Ploidy5_Heter0.1_100

Top

Ploidy level5
Heterozigosity0.1
Sequencing coverage100

Ploidy5_Heter0.1_100

Top

Ploidy3_Heter0.1_50

Top

Ploidy level3
Heterozigosity0.1
Sequencing coverage50

Ploidy3_Heter0.1_50

Top

Ploidy6_Heter0.0001_15

Top

Ploidy level6
Heterozigosity0.0001
Sequencing coverage15

Ploidy6_Heter0.0001_15

Top

Ploidy7_Heter0.1_15

Top

Ploidy level7
Heterozigosity0.1
Sequencing coverage15

Ploidy7_Heter0.1_15

Top

Ploidy5_Heter0.001_100

Top

Ploidy level5
Heterozigosity0.001
Sequencing coverage100

Ploidy5_Heter0.001_100

Top

Ploidy4_Heter0.001_100

Top

Ploidy level4
Heterozigosity0.001
Sequencing coverage100

Ploidy4_Heter0.001_100

Top

Ploidy6_Heter0.1_25

Top

Ploidy level6
Heterozigosity0.1
Sequencing coverage25

Ploidy6_Heter0.1_25

Top

Ploidy4_Heter0.1_25

Top

Ploidy level4
Heterozigosity0.1
Sequencing coverage25

Ploidy4_Heter0.1_25

Top

Ploidy5_Heter0.01_50

Top

Ploidy level5
Heterozigosity0.01
Sequencing coverage50

Ploidy5_Heter0.01_50

Top

Ploidy6_Heter0.01_100

Top

Ploidy level6
Heterozigosity0.01
Sequencing coverage100

Ploidy6_Heter0.01_100

Top

Ploidy2_Heter0.001_25

Top

Ploidy level2
Heterozigosity0.001
Sequencing coverage25

Ploidy2_Heter0.001_25

Top

Ploidy6_Heter0.01_25

Top

Ploidy level6
Heterozigosity0.01
Sequencing coverage25

Ploidy6_Heter0.01_25

Top

Ploidy4_Heter0.0001_100

Top

Ploidy level4
Heterozigosity0.0001
Sequencing coverage100

Ploidy4_Heter0.0001_100

Top

Ploidy6_Heter0.01_50

Top

Ploidy level6
Heterozigosity0.01
Sequencing coverage50

Ploidy6_Heter0.01_50

Top

Ploidy2_Heter0.0001_50

Top

Ploidy level2
Heterozigosity0.0001
Sequencing coverage50

Ploidy2_Heter0.0001_50

Top

Ploidy4_Heter0.01_15

Top

Ploidy level4
Heterozigosity0.01
Sequencing coverage15

Ploidy4_Heter0.01_15

Top

Ploidy2_Heter0.1_100

Top

Ploidy level2
Heterozigosity0.1
Sequencing coverage100

Ploidy2_Heter0.1_100

Top

Ploidy2_Heter0.001_50

Top

Ploidy level2
Heterozigosity0.001
Sequencing coverage50

Ploidy2_Heter0.001_50

Top

Ploidy4_Heter0.001_15

Top

Ploidy level4
Heterozigosity0.001
Sequencing coverage15

Ploidy4_Heter0.001_15

Top

Ploidy6_Heter0.1_50

Top

Ploidy level6
Heterozigosity0.1
Sequencing coverage50

Ploidy6_Heter0.1_50

Top

Ploidy3_Heter0.001_25

Top

Ploidy level3
Heterozigosity0.001
Sequencing coverage25

Ploidy3_Heter0.001_25

Top

Ploidy7_Heter0.0001_25

Top

Ploidy level7
Heterozigosity0.0001
Sequencing coverage25

Ploidy7_Heter0.0001_25

Top

Ploidy7_Heter0.0001_15

Top

Ploidy level7
Heterozigosity0.0001
Sequencing coverage15

Ploidy7_Heter0.0001_15

Top

Ploidy2_Heter0.0001_15

Top

Ploidy level2
Heterozigosity0.0001
Sequencing coverage15

Ploidy2_Heter0.0001_15

Top

Ploidy6_Heter0.001_50

Top

Ploidy level6
Heterozigosity0.001
Sequencing coverage50

Ploidy6_Heter0.001_50

Top

Ploidy5_Heter0.0001_50

Top

Ploidy level5
Heterozigosity0.0001
Sequencing coverage50

Ploidy5_Heter0.0001_50

Top

Ploidy5_Heter0.01_15

Top

Ploidy level5
Heterozigosity0.01
Sequencing coverage15

Ploidy5_Heter0.01_15

Top

Ploidy7_Heter0.0001_100

Top

Ploidy level7
Heterozigosity0.0001
Sequencing coverage100

Ploidy7_Heter0.0001_100

Top

Ploidy4_Heter0.1_50

Top

Ploidy level4
Heterozigosity0.1
Sequencing coverage50

Ploidy4_Heter0.1_50

Top

Ploidy5_Heter0.01_100

Top

Ploidy level5
Heterozigosity0.01
Sequencing coverage100

Ploidy5_Heter0.01_100

Top